Construction of a facsimile data set for large genome sequence analysis
Identifieur interne : 004A40 ( Main/Exploration ); précédent : 004A39; suivant : 004A41Construction of a facsimile data set for large genome sequence analysis
Auteurs : Oliver Seely Jr. [États-Unis] ; Da-Fei Feng [États-Unis] ; Douglas W. Smith [États-Unis] ; Daniel Sulzbach [États-Unis] ; Russell F. Doolittle [États-Unis]Source :
- Genomics [ 0888-7543 ] ; 1990.
English descriptors
- Teeft :
- Academic press, Amino acid sequences, Amino acids, Artificial introns, Base pairs, Base substitutions, Chicken collagen, Coding sequence, Coding sequences, Codon, Consensus sequence, Cosmid, Cray supercomputer, Current sequence collections, Current status, Diego supercomputer center, Direct comparison, Entry sequence, Exon, Exon length, Facsimile, Facsimile data, First step, Friezner degen, Genbank, Genbank documentation, Genbank entry, Genbank locus, Genbank release, Gene, Gene duplication, Gene products, General identification, Genome, Human genome, Human genome initiative, Intergenic sequences, Intron, Large amounts, Large genome, Last nucleotide positions, Lookup table, Macintosh, Macintosh computer, Mrna, Nucleic acids, Nucleotide, Peptide, Peptide sequences, Program gbprot, Prokaryotic, Prokaryotic sequences, Random intergenic sequences, Random sequences, Reading frame, Reading frames, Reasonable facsimile, Refset, Refset half, Sample cosmid, Search results, Seely, Sequence, Sequence data, Sequencing, Similar segments, Splice junctions, Supplementary information, Terminator codons, Testset, Unannotated genbank entries, Unknown sequence.
Abstract
Abstract: A test was devised for exploring the question of whether it will be possible to identify genes in largescale genome studies solely by sequence comparison with current sequence collections. To this end, a facsimile data set was constructed by dividing GenBank Release 56 randomly into two halves, one to serve as a reference set and the other intended to simulate raw data anticipated from large genome sequence projects. All supplementary information and identifying marks were removed from the test set after assignment of random identification numbers to each entry and their encryption. Because noncoding intervening sequences (introns) are underrepresented in GenBank, a program that introduced (simulated) introns into mRNA and prokaryotic sequences was devised. In a further attempt to make the problem of identification more realistic, random base substitutions and single-base deletions were also incorporated. The randomly ordered entries were concatenated, along with random intergenic flanking sequences, into a single long “chromosome” 33 Mb in length and then cut into “cosmids” 50–100 kb long. The chopping process was conducted in such a way that terminal overlaps would allow the order of the entries in the chromosome to be reconstituted. Finally, the sequences of a substantial fraction of the cosmids were converted to their complements. Preliminary searching of 10 test cosmids revealed that more than two-thirds of the entries in the test set should be readily identifiable by type of gene product solely on the basis of comparison with the reference set. These preliminary results suggest that existing computer regimens and sequence collections would be able to identify the majority of eukaryotic genes in any new raw data set, the existence of introns notwithstanding. Moreover, the analysis can be conducted in pace with the data collection so that the search results and summary identifications will be instantly available to the research community at large.
Url:
DOI: 10.1016/0888-7543(90)90227-L
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream Istex, to step Corpus: 002073
- to stream Istex, to step Curation: 002073
- to stream Istex, to step Checkpoint: 001F65
- to stream Main, to step Merge: 004B18
- to stream Main, to step Curation: 004A40
Le document en format XML
<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title>Construction of a facsimile data set for large genome sequence analysis</title>
<author><name sortKey="Seely Jr, Oliver" sort="Seely Jr, Oliver" uniqKey="Seely Jr O" first="Oliver" last="Seely Jr.">Oliver Seely Jr.</name>
</author>
<author><name sortKey="Feng, Da Fei" sort="Feng, Da Fei" uniqKey="Feng D" first="Da-Fei" last="Feng">Da-Fei Feng</name>
</author>
<author><name sortKey="Smith, Douglas W" sort="Smith, Douglas W" uniqKey="Smith D" first="Douglas W." last="Smith">Douglas W. Smith</name>
</author>
<author><name sortKey="Sulzbach, Daniel" sort="Sulzbach, Daniel" uniqKey="Sulzbach D" first="Daniel" last="Sulzbach">Daniel Sulzbach</name>
</author>
<author><name sortKey="Doolittle, Russell F" sort="Doolittle, Russell F" uniqKey="Doolittle R" first="Russell F." last="Doolittle">Russell F. Doolittle</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:B7439848BFC50D16530B081813FB2D99A18BB6B6</idno>
<date when="1990" year="1990">1990</date>
<idno type="doi">10.1016/0888-7543(90)90227-L</idno>
<idno type="url">https://api.istex.fr/ark:/67375/6H6-H8LNX0WC-9/fulltext.pdf</idno>
<idno type="wicri:Area/Istex/Corpus">002073</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Corpus" wicri:corpus="ISTEX">002073</idno>
<idno type="wicri:Area/Istex/Curation">002073</idno>
<idno type="wicri:Area/Istex/Checkpoint">001F65</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Checkpoint">001F65</idno>
<idno type="wicri:doubleKey">0888-7543:1990:Seely Jr O:construction:of:a</idno>
<idno type="wicri:Area/Main/Merge">004B18</idno>
<idno type="wicri:Area/Main/Curation">004A40</idno>
<idno type="wicri:Area/Main/Exploration">004A40</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a">Construction of a facsimile data set for large genome sequence analysis</title>
<author><name sortKey="Seely Jr, Oliver" sort="Seely Jr, Oliver" uniqKey="Seely Jr O" first="Oliver" last="Seely Jr.">Oliver Seely Jr.</name>
<affiliation wicri:level="2"><country xml:lang="fr">États-Unis</country>
<placeName><region type="state">Californie</region>
</placeName>
<wicri:cityArea>Center for Molecular Genetics, University of California at San Diego, La Jolla</wicri:cityArea>
</affiliation>
<affiliation wicri:level="2"><country xml:lang="fr">États-Unis</country>
<placeName><region type="state">Californie</region>
</placeName>
<wicri:cityArea>San Diego Supercomputer Center, University of California at San Diego, La Jolla</wicri:cityArea>
</affiliation>
<affiliation wicri:level="2"><country xml:lang="fr">États-Unis</country>
<placeName><region type="state">Californie</region>
</placeName>
<wicri:cityArea>1 Permanent address: Department of Chemistry, California State University, Dominguez Hills, Carson</wicri:cityArea>
</affiliation>
</author>
<author><name sortKey="Feng, Da Fei" sort="Feng, Da Fei" uniqKey="Feng D" first="Da-Fei" last="Feng">Da-Fei Feng</name>
<affiliation wicri:level="2"><country xml:lang="fr">États-Unis</country>
<placeName><region type="state">Californie</region>
</placeName>
<wicri:cityArea>Center for Molecular Genetics, University of California at San Diego, La Jolla</wicri:cityArea>
</affiliation>
<affiliation wicri:level="2"><country xml:lang="fr">États-Unis</country>
<placeName><region type="state">Californie</region>
</placeName>
<wicri:cityArea>San Diego Supercomputer Center, University of California at San Diego, La Jolla</wicri:cityArea>
</affiliation>
</author>
<author><name sortKey="Smith, Douglas W" sort="Smith, Douglas W" uniqKey="Smith D" first="Douglas W." last="Smith">Douglas W. Smith</name>
<affiliation wicri:level="2"><country xml:lang="fr">États-Unis</country>
<placeName><region type="state">Californie</region>
</placeName>
<wicri:cityArea>Center for Molecular Genetics, University of California at San Diego, La Jolla</wicri:cityArea>
</affiliation>
<affiliation wicri:level="2"><country xml:lang="fr">États-Unis</country>
<placeName><region type="state">Californie</region>
</placeName>
<wicri:cityArea>San Diego Supercomputer Center, University of California at San Diego, La Jolla</wicri:cityArea>
</affiliation>
</author>
<author><name sortKey="Sulzbach, Daniel" sort="Sulzbach, Daniel" uniqKey="Sulzbach D" first="Daniel" last="Sulzbach">Daniel Sulzbach</name>
<affiliation wicri:level="2"><country xml:lang="fr">États-Unis</country>
<placeName><region type="state">Californie</region>
</placeName>
<wicri:cityArea>Center for Molecular Genetics, University of California at San Diego, La Jolla</wicri:cityArea>
</affiliation>
<affiliation wicri:level="2"><country xml:lang="fr">États-Unis</country>
<placeName><region type="state">Californie</region>
</placeName>
<wicri:cityArea>San Diego Supercomputer Center, University of California at San Diego, La Jolla</wicri:cityArea>
</affiliation>
</author>
<author><name sortKey="Doolittle, Russell F" sort="Doolittle, Russell F" uniqKey="Doolittle R" first="Russell F." last="Doolittle">Russell F. Doolittle</name>
<affiliation></affiliation>
<affiliation wicri:level="2"><country xml:lang="fr">États-Unis</country>
<placeName><region type="state">Californie</region>
</placeName>
<wicri:cityArea>Center for Molecular Genetics, University of California at San Diego, La Jolla</wicri:cityArea>
</affiliation>
<affiliation wicri:level="2"><country xml:lang="fr">États-Unis</country>
<placeName><region type="state">Californie</region>
</placeName>
<wicri:cityArea>San Diego Supercomputer Center, University of California at San Diego, La Jolla</wicri:cityArea>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="j">Genomics</title>
<title level="j" type="abbrev">YGENO</title>
<idno type="ISSN">0888-7543</idno>
<imprint><publisher>ELSEVIER</publisher>
<date type="published" when="1990">1990</date>
<biblScope unit="volume">8</biblScope>
<biblScope unit="issue">1</biblScope>
<biblScope unit="page" from="71">71</biblScope>
<biblScope unit="page" to="82">82</biblScope>
</imprint>
<idno type="ISSN">0888-7543</idno>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0888-7543</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="Teeft" xml:lang="en"><term>Academic press</term>
<term>Amino acid sequences</term>
<term>Amino acids</term>
<term>Artificial introns</term>
<term>Base pairs</term>
<term>Base substitutions</term>
<term>Chicken collagen</term>
<term>Coding sequence</term>
<term>Coding sequences</term>
<term>Codon</term>
<term>Consensus sequence</term>
<term>Cosmid</term>
<term>Cray supercomputer</term>
<term>Current sequence collections</term>
<term>Current status</term>
<term>Diego supercomputer center</term>
<term>Direct comparison</term>
<term>Entry sequence</term>
<term>Exon</term>
<term>Exon length</term>
<term>Facsimile</term>
<term>Facsimile data</term>
<term>First step</term>
<term>Friezner degen</term>
<term>Genbank</term>
<term>Genbank documentation</term>
<term>Genbank entry</term>
<term>Genbank locus</term>
<term>Genbank release</term>
<term>Gene</term>
<term>Gene duplication</term>
<term>Gene products</term>
<term>General identification</term>
<term>Genome</term>
<term>Human genome</term>
<term>Human genome initiative</term>
<term>Intergenic sequences</term>
<term>Intron</term>
<term>Large amounts</term>
<term>Large genome</term>
<term>Last nucleotide positions</term>
<term>Lookup table</term>
<term>Macintosh</term>
<term>Macintosh computer</term>
<term>Mrna</term>
<term>Nucleic acids</term>
<term>Nucleotide</term>
<term>Peptide</term>
<term>Peptide sequences</term>
<term>Program gbprot</term>
<term>Prokaryotic</term>
<term>Prokaryotic sequences</term>
<term>Random intergenic sequences</term>
<term>Random sequences</term>
<term>Reading frame</term>
<term>Reading frames</term>
<term>Reasonable facsimile</term>
<term>Refset</term>
<term>Refset half</term>
<term>Sample cosmid</term>
<term>Search results</term>
<term>Seely</term>
<term>Sequence</term>
<term>Sequence data</term>
<term>Sequencing</term>
<term>Similar segments</term>
<term>Splice junctions</term>
<term>Supplementary information</term>
<term>Terminator codons</term>
<term>Testset</term>
<term>Unannotated genbank entries</term>
<term>Unknown sequence</term>
</keywords>
</textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Abstract: A test was devised for exploring the question of whether it will be possible to identify genes in largescale genome studies solely by sequence comparison with current sequence collections. To this end, a facsimile data set was constructed by dividing GenBank Release 56 randomly into two halves, one to serve as a reference set and the other intended to simulate raw data anticipated from large genome sequence projects. All supplementary information and identifying marks were removed from the test set after assignment of random identification numbers to each entry and their encryption. Because noncoding intervening sequences (introns) are underrepresented in GenBank, a program that introduced (simulated) introns into mRNA and prokaryotic sequences was devised. In a further attempt to make the problem of identification more realistic, random base substitutions and single-base deletions were also incorporated. The randomly ordered entries were concatenated, along with random intergenic flanking sequences, into a single long “chromosome” 33 Mb in length and then cut into “cosmids” 50–100 kb long. The chopping process was conducted in such a way that terminal overlaps would allow the order of the entries in the chromosome to be reconstituted. Finally, the sequences of a substantial fraction of the cosmids were converted to their complements. Preliminary searching of 10 test cosmids revealed that more than two-thirds of the entries in the test set should be readily identifiable by type of gene product solely on the basis of comparison with the reference set. These preliminary results suggest that existing computer regimens and sequence collections would be able to identify the majority of eukaryotic genes in any new raw data set, the existence of introns notwithstanding. Moreover, the analysis can be conducted in pace with the data collection so that the search results and summary identifications will be instantly available to the research community at large.</div>
</front>
</TEI>
<affiliations><list><country><li>États-Unis</li>
</country>
<region><li>Californie</li>
</region>
</list>
<tree><country name="États-Unis"><region name="Californie"><name sortKey="Seely Jr, Oliver" sort="Seely Jr, Oliver" uniqKey="Seely Jr O" first="Oliver" last="Seely Jr.">Oliver Seely Jr.</name>
</region>
<name sortKey="Doolittle, Russell F" sort="Doolittle, Russell F" uniqKey="Doolittle R" first="Russell F." last="Doolittle">Russell F. Doolittle</name>
<name sortKey="Doolittle, Russell F" sort="Doolittle, Russell F" uniqKey="Doolittle R" first="Russell F." last="Doolittle">Russell F. Doolittle</name>
<name sortKey="Feng, Da Fei" sort="Feng, Da Fei" uniqKey="Feng D" first="Da-Fei" last="Feng">Da-Fei Feng</name>
<name sortKey="Feng, Da Fei" sort="Feng, Da Fei" uniqKey="Feng D" first="Da-Fei" last="Feng">Da-Fei Feng</name>
<name sortKey="Seely Jr, Oliver" sort="Seely Jr, Oliver" uniqKey="Seely Jr O" first="Oliver" last="Seely Jr.">Oliver Seely Jr.</name>
<name sortKey="Seely Jr, Oliver" sort="Seely Jr, Oliver" uniqKey="Seely Jr O" first="Oliver" last="Seely Jr.">Oliver Seely Jr.</name>
<name sortKey="Smith, Douglas W" sort="Smith, Douglas W" uniqKey="Smith D" first="Douglas W." last="Smith">Douglas W. Smith</name>
<name sortKey="Smith, Douglas W" sort="Smith, Douglas W" uniqKey="Smith D" first="Douglas W." last="Smith">Douglas W. Smith</name>
<name sortKey="Sulzbach, Daniel" sort="Sulzbach, Daniel" uniqKey="Sulzbach D" first="Daniel" last="Sulzbach">Daniel Sulzbach</name>
<name sortKey="Sulzbach, Daniel" sort="Sulzbach, Daniel" uniqKey="Sulzbach D" first="Daniel" last="Sulzbach">Daniel Sulzbach</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 004A40 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 004A40 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Sante |area= MersV1 |flux= Main |étape= Exploration |type= RBID |clé= ISTEX:B7439848BFC50D16530B081813FB2D99A18BB6B6 |texte= Construction of a facsimile data set for large genome sequence analysis }}
This area was generated with Dilib version V0.6.33. |